SQUAD OF FIFA 2018 by SAHIL CHUTANI

Introduction

Where do most of the players in FIFA 2018 come from? Is it South America or Europe? What is the most common age of the players listed in FIFA 2018? What is the age range of players? What is the distribution of their performance? These are the questions I would like to find an answer for through Exploratory Data Analysis. I will make use of the ggplot2 library that I learnt in the lesson coupled with plotly for interactive visualization.

Dataset

The dataset features every player in Fifa 2018 with 70+ attributes. It contains personal attributes like Nationality, Photo, Club Age, Wage, Salary etc. I downloaded dataset from https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset.

Dataset is tidy except for a few columns like the Wage, Value and Preferred.Positions. I would extract the numeric values from Wage and Value columns, and pull out the most preferred position from the Preferred.Positions column with the assumption the position are in order of preference.

Summary of Fifa 2018

##                   Name            Age       
##  J. Rodr<c3><ad>guez:    7   Min.   :16.00  
##  J. Valencia        :    7   1st Qu.:21.00  
##  J. Williams        :    7   Median :25.00  
##  D. Gonz<c3><a1>lez :    6   Mean   :25.14  
##  Danilo             :    6   3rd Qu.:28.00  
##  Felipe             :    6   Max.   :47.00  
##  (Other)            :17942                  
##                                              Photo      
##  https://cdn.sofifa.org/48/18/players/197083.png:    2  
##  https://cdn.sofifa.org/48/18/players/198113.png:    2  
##  https://cdn.sofifa.org/48/18/players/198140.png:    2  
##  https://cdn.sofifa.org/48/18/players/198329.png:    2  
##  https://cdn.sofifa.org/48/18/players/198584.png:    2  
##  https://cdn.sofifa.org/48/18/players/198614.png:    2  
##  (Other)                                        :17969  
##  Nationality           Overall        Potential    
##  Length:17981       Min.   :46.00   Min.   :46.00  
##  Class :character   1st Qu.:62.00   1st Qu.:67.00  
##  Mode  :character   Median :66.00   Median :71.00  
##                     Mean   :66.25   Mean   :71.19  
##                     3rd Qu.:71.00   3rd Qu.:75.00  
##                     Max.   :94.00   Max.   :94.00  
##                                                    
##                 Club          Value               Wage          
##                   :  248   Length:17981       Length:17981      
##  Villarreal CF    :   35   Class :character   Class :character  
##  Borussia Dortmund:   34   Mode  :character   Mode  :character  
##  FC Nantes        :   34                                        
##  Manchester United:   34                                        
##  OGC Nice         :   34                                        
##  (Other)          :17562                                        
##  Preferred.Positions  Continent        
##  Length:17981        Length:17981      
##  Class :character    Class :character  
##  Mode  :character    Mode  :character  
##                                        
##                                        
##                                        
## 

Observations

  1. Age ranges from 16 to 47 years with a mean of 25.14 and a median 25. I am thinking of a normal distribution of Age. I would plot a histogram in univariate plots section to see if this is the case.

  2. Looking at the Nationality column. Top 5 countries are all from from either Europe or South America. In the univariate plot section I would perform a group by operation by Nationality and plot on a map to visualize the distribution of players by country.

  3. The Overall and Potential columns both range from 46 to 94 with mean 66 and 71 respectively. the 5 point difference in mean makes me wonder how many players have scope of improvement. I would like to explore difference of the two columns in the plot section below.I expect these two columns to be heavily correlated.

Univariate Plots Section


Indeed the age distribution looks normal. 1522 players are Aged 25 years and most of the players are clustered around 25 years. I expected such observation.


No surprises here either most players have an Overall score of 66.



The plot shows that most players have a potential to score 70 points, it is 1 point less than the potential score mean.

Most players are already at their best. I observe that some players have a potential to score more than 10 points than they currently do. I wonder the belong to which countries. I would explore this further when I visualize the distributions on world map.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.00   14.00   21.00   31.95   36.00  565.00   12727


The wage variable has a lot of NAs. I will discard this variable from any further analysis.


##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      10     300     625    2252    1600  123000    1548


The value variable is very intriguing. Median value is 625K, meaning half the players are valued less than 625K and half are more than 625K. The 3rd quartile is 1.6M and the maximum value is 123M. Infact I expected such observation, because most players are not valued in the millions but I would like to explore further about the high valued plyers.


## # A tibble: 6 x 11
##   Nationality    mean_Overall max_Overall mean_Potential max_Potential
##   <chr>                 <dbl>       <dbl>          <dbl>         <dbl>
## 1 United Kingdom         63.1        89.0           69.9          90.0
## 2 Germany                65.9        92.0           71.6          92.0
## 3 Spain                  69.9        90.0           74.8          92.0
## 4 France                 67.3        88.0           73.0          94.0
## 5 Argentina              67.8        93.0           72.5          93.0
## 6 Brazil                 70.9        92.0           72.9          94.0
## # ... with 6 more variables: mean_Age <dbl>, mean_Diff <dbl>,
## #   max_Diff <dbl>, mean_Value <dbl>, max_Value <dbl>, n <int>


Clearly the most redder regions are in South America and Europe. UK has the highest number of players.Most of Asia and Africa are grey in color, meaning less than 60 players are from these regions. In the middle East, there is a stark contrast between Nations, Saudi Arabia is much redder than other nations. Surprising observations are from Canada and New Zealand, both are high income countries but are grey in color, perhaps population impacts the number of players from a country.



Center Back is the most preferred position and Right Wing Back is the least preferred position. I wonder if preferrence has an impact on value.

Univariate Analysis

After exploring dataset for various variables. I have following conclusions:

  1. Most of the players in FIFA 2018 belong to Europe or South America. UK has the highest number of players.
  2. Most players have already attained their potential as their overall is equal to potential.
  3. Center Back (CB) is the most preferred position amongst players and Right Wing Back (RWB) is the least preferred position.

What is the structure of your dataset?

The dataset is tidy. Apart from a few changes like extracting numbers from a variable, I don’t need to make any more changes.

What is/are the main feature(s) of interest in your dataset?

Most interesting features in the dataset are Nationality, Age, Potential, Overall, Value, and Preferred.Positions. A brief description of the features is as follows: 1. Nationality : Nationality of the player. 2. Age : Age of player 3. Potential : The potential of player. 4. Overall : The current overall standing of the player. 5. Value: What is the players value in Thousands of pounds. 6. Preferred.Positions : The preferred position of the player.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Other features like continent might be helpfu, I would explore if it is.

Did you create any new variables from existing variables in the dataset?

I created a variable PO_Diff, which accounts for difference in Potential and Overall. I also created a variable Continent.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I extracted numerical value from Wage and Value variable. Further I pulled out most preferred position from Preferred.Position variable.

Bivariate Plots Section


Most regions of the world seem uniform when it comes to Age Distribution with the exception of nations in Africa. There are subtle differences though.

Looking at the overall score across countries, one would think that Mozambique, Oman, and Syria are amongst the countries supplying the best players in the world. It appears fishy as it should, because these countries don’t even have number of players in 2 digits. Syria only has 1, compare this with Brazil, with 812 players and 70.9 overall mean score. I will plot the map once again, using only nations that have atleast 200 players in fifa 2018.

Now it is a better picture. Clearly Brazil and Spain appear to be nations that produce players with better mean overall score.

South America and Europe are more yellow in color compared to other continents. Asia and Africa are more blue side.

I subset data straightaway, because many nations don’t have considerable player count. From the map it is clear once again that South America and Europe tend to be on the higher side of the score. Interestingly Spain has the highest potential, and not Brazil.

Portugal naturally has the highest potential of 94, as Cristiano Ronaldo already has an overall of 94. Spain top most potential is 92 even though it has the highest mean potential.

Once again I took the subset of data, taking into consideration countries that have atleast 200 players listed in fifa. Brazil and Chile appear to be on the level at par with their potential.United Kingdom huge difference is a shocker. Perhaps it is because of younger players that it has.

Western countries appear to be in possession of players most likely to improve. Countries in Asia and Africa, which already have most of their players with low Overall score, also have low Potential score.

Once again Europe and South America are doing better than other continents. Lets see where the most valued player is from.

Most Valued player is from Brazil.


From above curve I see meaningful correlation between the following:

  1. Potential and num_Value
  2. Overall and num_Value
  3. Overall and Potential
  4. Age and Overall


Normally one would expect value of a player to rise with potential and looking at correlation it does appear so. However, there are players with potential above 90 and value only 975K. Maybe the player preferred position has an impact on salary.

Same story with overall, value does increase with overall score but there are some players with high overall score and less value. I wonder why are they undervalued? I would explore this further in multivariate plots.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!